Identifying personal genomes by surname inference.

نویسندگان

  • Melissa Gymrek
  • Amy L McGuire
  • David Golan
  • Eran Halperin
  • Yaniv Erlich
چکیده

Sharing sequencing data sets without identifiers has become a common practice in genomics. Here, we report that surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome (Y-STRs) and querying recreational genetic genealogy databases. We show that a combination of a surname with other types of metadata, such as age and state, can be used to triangulate the identity of the target. A key feature of this technique is that it entirely relies on free, publicly accessible Internet resources. We quantitatively analyze the probability of identification for U.S. males. We further demonstrate the feasibility of this technique by tracing back with high probability the identities of multiple participants in public sequencing projects.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Excavating past population structures by surname-based sampling: the genetic legacy of the Vikings in northwest England.

The genetic structures of past human populations are obscured by recent migrations and expansions and have been observed only indirectly by inference from modern samples. However, the unique link between a heritable cultural marker, the patrilineal surname, and a genetic marker, the Y chromosome, provides a means to target sets of modern individuals that might resemble populations at the time o...

متن کامل

Variation in Hispanic Self-Identification, Spanish Surname, and Geocoding: Implications for Ethnicity Data Collection

This study examines the variation in surname analysis and geocoding, and their association with self-identified Hispanics in an HMO. We collected ethnicity data from three studies, and employed Spanish surname software and census tract level geocoding to create proxies for Hispanic ethnicity. We computed sensitivity, specificity, and estimated multivariate logistic regression models to examine ...

متن کامل

Assessing record linkage between health care and Vital Statistics databases using deterministic methods

BACKGROUND We assessed the linkage and correct linkage rate using deterministic record linkage among three commonly used Canadian databases, namely, the population registry, hospital discharge data and Vital Statistics registry. METHODS Three combinations of four personal identifiers (surname, first name, sex and date of birth) were used to determine the optimal combination. The correct linka...

متن کامل

Evolutionary inference across eukaryotes identifies multiple pressures favoring mitochondrial gene retention

Since their endosymbiotic origin, mitochondria have lost most of their genes. Although many selective mechanisms underlying the evolution of mitochondrial genomes have been proposed, a data-driven exploration of these hypotheses is lacking, and a quantitatively supported consensus remains absent. We developed HyperTraPS, a methodology coupling stochastic modelling with Bayesian inference, to id...

متن کامل

Recruiting Hispanic women for a population-based study: validity of surname search and characteristics of nonparticipants.

Conducting research on the health of Hispanic populations in the United States entails challenges of identifying individuals who are Hispanic and obtaining good study participation. In this report, identification of Hispanics using a surname search and ethnicity information collected by cancer registries was validated, compared with self-report, for breast cancer cases and controls in Utah and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Science

دوره 339 6117  شماره 

صفحات  -

تاریخ انتشار 2013